【Day 24】Fine-tuning 的一些準備 - huggingface 上的 Whisper model

15th鐵人賽 huggingface fine-tuning

leo271828

2023-10-09 23:59:53

1148 瀏覽

分享至

就像上一篇說的，huggingface 上有許多模型可以下載來玩
我們就來實際玩看看 huggingface 上的 openai/whisper-small 做為主要模型

Model detail

有點忘記之前有沒有貼過了，這個是 Whisper 的模型大小差異：

Model	Layers	Width	Heads	Parameters
Tiny	4	384	6	39M
Base	6	512	8	74M
Small	12	768	12	244M
Medium	24	1024	16	769M
Large	32	1280	20	1550M

出自 Whisper-paper

Usage

這邊想稍微提一下 WhisperProcesser ，裡面有說到有關於 special_token 這個東西
在我的理解中，他其實就是特殊 token

有用兩個 <|, |> 框起來的應該就是指 token

那就先來處理環境啦

!pip install datasets
!pip install git+https://github.com/huggingface/transformers
!pip install librosa

我 transformers 是 pip install transformers 好像也可

接著把套件 import 進來

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

而這邊範例是 whisper-small 模型，如果想要厲害一點也可以換成 medium
不過那就必須注意一下自己的 GPU 規格喔，不然可能用到後面會有問題，或是他直接切成 CPU

我們就乖乖用 small

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="chinese", task="translate")

官方文件用法文來做測試，我們可以直接改成中文版本

然後把 datasets 下載進本地端

# load streaming dataset and read first audio sample
ds = load_dataset("common_voice", "zh-TW", split="test", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

本來以為沒什麼問題，結果就出事了

input_speech = next(iter(ds))["audio"]
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

Windows 這邊跑完這段就出現 ModuleNotFoundError: No module named 'soundfile' 和 ImportError: To support encoding audio data, please install 'soundfile'. 這兩個問題
後來發現應該是套件沒裝好

就決定...交給明天的自己！！！